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: Asa World Health Organization Research and Development Blueprint priority pathogen, there is a need 

: to better understand the geographic distribution of Middle East Respiratory Syndrome Coronavirus 

: (MERS-CoV) and its potential to infect mammals and humans. This database documents cases of MERS- 
CoV globally, with specific attention paid to zoonotic transmission. An initial literature search was 
conducted in PubMed, Web of Science, and Scopus; after screening articles according to the inclusion/ 
exclusion criteria, a total of 208 sources were selected for extraction and geo-positioning. Each MERS- 
CoV occurrence was assigned one of the following classifications based upon published contextual 
information: index, unspecified, secondary, mammal, environmental, or imported. In total, this 
database is comprised of 861 unique geo-positioned MERS-CoV occurrences. The purpose of this article 
is to share a collated MERS-CoV database and extraction protocol that can be utilized in future mapping 

: efforts for both MERS-CoV and other infectious diseases. More broadly, it may also provide useful data 

: forthe development of targeted MERS-CoV surveillance, which would prove invaluable in preventing 

: future zoonotic spillover. 


: Background & Summary 

: Middle East Respiratory Syndrome Coronavirus (MERS-CoV) emerged as a global health concern in 2012 
: when the first human case was documented in Saudi Arabia!. Now listed as one of the WHO Research and 
: Development Blueprint priority pathogens, cases have been reported in 27 countries across four continents’. 
: Imported cases into non-endemic countries such as France, Great Britain, the United States, and South Korea 
: have caused secondary cases’, thus highlighting the potential for MERS-CoV to spread far beyond the countries 
: where index cases originate. Reports in animals suggest that viral circulation could be far more widespread than 
: suggested by human cases alone. 

: To help prevent future incidence of MERS-CoV, public health officials can focus on mitigating zoonotic trans- 
: fer; however, in order to do this effectively, additional research is needed to determine where spillover could occur 
: between mammals and humans. Previous literature reviews have looked at healthcare-associated outbreaks’, 
: importation events resulting in secondary cases!™!!, occurrences among dromedary camels’, or to summa- 
: rize current knowledge and knowledge gaps of MERS-CoV'*»». This database seeks fill gaps in literature and 
: build upon existing notification data by enhancing the geographic resolution of MERS-CoV data and providing 
: occurrences of both mammal and environmental detections in addition to human cases. This information can 
: help inform epidemiological models and targeted disease surveillance, both of which play important roles in 
: strengthening global health security. Knowledge of the geographic extent of disease transmission allows stake- 
: holders to develop appropriate emergency response and preparedness activities (https://www.jeealliance.org/ 
: global-health-security-and-ihr-implementation/joint-external-evaluation-jee/), inform policy for livestock trade 
: and quarantine, determine appropriate demand for future vaccines (http://cepi.net/mission) and decide where 
: to deliver them. Additionally, targeted disease surveillance will provide healthcare workers with updated lists of 
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Fig. 1 Middle East Respiratory Syndrome Coronavirus (MERS-CoV) literature extraction flowchart. Process of 
data source selection from initial literature search to extraction. 


at-risk countries. Patients with a history of travel to affected regions could then be rapidly isolated and treated, 
thus reducing risk of nosocomial transmission. 

This database is comprised of 861 unique geo-positioned MERS-CoV occurrences extracted from reports pub- 
lished between October 2012 and February 2018. It systematically captures unique occurrences of MERS-CoV 
globally by geo-tagging published reports of MERS-CoV cases and detections. Data collection, database creation, 
and geo-tagging methods are described below. Instructions on how to access the database are provided as well, 
with the aim to contribute to future epidemiological analysis. All data is available from the Global Health Data 
Exchange’® and Figshare’”. 


Methods 

The methods and protocols summarized below have been adapted from previously published literature extraction 
processes'*-”, and provide additional context surrounding our systematic data collection from published reports 
of MERS-CoV. 


Data collection. We identified published reports of MERS-CoV by searching PubMed, Web of Science, and 
Scopus with the following terms: “Middle Eastern Respiratory Syndrome’, “Middle East Respiratory Syndrome’, 
“MERSCoV”, and “MERS”. The initial search was for all articles published about MERS-CoV prior to April 30, 
2017, and was subsequently updated to February 22, 2018. These searches were conducted through the University 
of Washington Libraries’ institutional database subscriptions. We searched the Web of Science Web of Science 
Core Collection (the subscribed edition includes Science Citation Index Expanded, 1900-present; Social Sciences 
Citation Index, 1975-present; Arts & Humanities Citation Index, 1975-present; Emerging Sources Citation Index, 
2015-present). We searched the standard Scopus database and the standard, freely available PubMed database; 
these products have a single version that is consistent across institutional subscriptions or access points. 

In total, this search returned 7,301 related abstracts, which were collated into a database before a title-abstract 
screening was manually conducted (Fig. 1. Flowchart). Articles were removed if they did not contain an occur- 
rence of MERS-CoV; for example, vaccine development research or coronavirus proteomic analyses. Non-English 
articles were flagged for further review and brought into the full text screening stage. The accompanying supple- 
mentary file highlight the title and abstract screening process and the inclusion and exclusion criteria. 

Full text review was conducted on 1,083 sources. To meet the inclusion criteria, articles must have contained 
both of the following items: 1) a detection of MERS-CoV from humans, animals, or environmental sources, and 
2) MERS-CoV occurrences tagged with spatial information. Additionally, extractors attempted to prospectively 
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Index 34 99 1 93 7 234 
Unspecified 86 50 1 4 35 27 203 
Mammal 53 56 7 30 43 19 208 
Import 11 2 0 2 10 9 34 
Secondary 82 30 1 1 26 8 148 
Absent 3 8 0 0 7 3 21 
Environmental 0 1 0 0 0 2 3 
MERS-CoV-like 1 1 7 0 0 1 10 


Table 1. Middle East Respiratory Syndrome Coronavirus (MERS-CoV) occurrences by patient type and 
geographic precision. 


manually remove articles containing duplicate occurrences that were already extracted in the dataset. Extractors 
only prospectively manually removed articles if it was clear the articles contained data we were confident had 
already been extracted and had high-quality data. We excluded 885 sources based on full text review. In addi- 
tion, we reviewed citations and retroactively added relevant articles to our database if they were not already 
included. We retroactively added and subsequently marked ten articles for extraction using this process. In total, 
we extracted 208 peer-reviewed sources reporting detection of MERS-CoV that included geographic and relevant 
epidemiological metadata. 


Geo-positioning of data. Google Maps or ArcGIS” was used to manually extract location information 
at the highest resolution available from individual articles. We evaluated spatial information as either points or 
polygons. The geography was defined as a point if the location of transmission was reported to have occurred 
within a5 x 5km area. Point data are represented by a specific latitude and longitude. A point references an area 
smaller than 5 x 5km in order to be compatible with the typical 5 x 5km resolution of satellite imagery used for 
global analyses. 

The geography was defined as a polygon if the location of transmission was less clear, but known to have 
occurred in a general area (e.g. a province), or the location of transmission occurred within an area greater than 
5 x 5km (e.g. a large city). We used contextual information to determine location in instances where the author’s 
spelling of a location differed from Google Maps or ArcGIS. Maps provided by authors were digitized using 
ArcGIS. 

We used three different types of polygons: known administrative boundaries, buffers, and custom polygons. 
Relevant administrative units were sourced from the Global Administrative Unit Layers curated by the Food and 
Agricultural Organization of the UN™ for known administrative boundaries of governorates, districts, or regions, 
and paired with the occurrence record. Buffers were created to encompass areas in cities and regions without cor- 
responding administrative units. To ensure that buffers encompassed the entirety of the area of interest, Google 
Maps was used to determine the required radius. In areas with unspecified boundaries (e.g. Table Mountain 
National Park and the border region between Saudi Arabia and UAE) ArcGIS was used to generate custom poly- 
gons, which were assigned a unique code within a defined shapefile for ease of re-identification. 


Data Records 

This database is publicly available online'®'’. Each of the 861 rows represents a unique occurrence of MERS-CoV. 
Rows containing an index, unspecified, or imported case represent a single case of MERS-CoV. Rows containing 
mammal and secondary cases may represent more than one case but are still unique geospatial occurrences. 
Table 1 shows an overview of the content available in the publicly available dataset. In addition, online-only 
Table 1 lists occurrences by geography, origin, 405 shape type, and publication and online-only Table 2 provides 
citations of the data. 


16,17 


nid: A unique identifier assigned to each publication that was extracted 

title: Title of the publication 

author: Article’s author(s). 

doi: Articles DOI. 

abstract: Articles abstract, if available. 

source_title: Journal in which the article was published. 

year: Article’s publication year. 

source: Database where article was found. 

pmid_if_applicable: PMID if the article is from PubMed. 

10. full_text_link_if_included: Link to the full text, if available. 

11. file_id: Reference to pdf in format FirstAuthor_Year (e.g. Smith_2017). 

12. occ_id: Unique identifier assigned to each occurrence of MERS-CoV. A single pdf may represent more 
than one occurrence. Each row will have its own occ_id, starting at 1 and numbered consecutively to 883. 

13. organism_type: What type of organism tested positive for MERS-CoV (human, mammal, or 
environmental). 

14. organism_specific: Specifies the exact organism that tested positive for MERS-CoV. Names are made 

consistent with Wilson and Reeder (2005) Mammal Species of the World”. 


SMAUM PR wWh p 
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15. pathogen: Name the pathogen identified (e.g. MERS-CoV, Bat Coronaviruses, and other MERS-CoV-like 
pathogens). 

16. pathogen_note: Miscellaneous notes regarding pathogen. 

17. patient_type: index, unspecified, NA, secondary, import, or absent. 


e index: Any human infection of MERS-CoV resulting after direct contact with an animal and no reported 
contact with a confirmed MERS-CoV case or healthcare setting. 

e unspecified: Cases that lacked sufficient epidemiological evidence to classify them as any other status (e.g. 
serosurvey studies). 

e NA: Non-applicable field; case was not a patient (e.g. mammal) 

e secondary: Defined as any cases resulting from contact with known human infections. Cases reported after 
the index case can be assumed to be secondary cases unless accompanied by specific details of likely inde- 
pendent exposure to an animal reservoir. 

e import: Cases that were brought into a non-endemic country after transmission occurred elsewhere. 

e absent: Suspected case(s) ultimately confirmed negative for MERS-CoV. 


18. transmission_route: zoonotic, direct, unspecified, or animal-to-animal. 


e zoonotic: Transmission occurred from an animal to a human. 

e direct: Only relevant for human-to-human transmission. 

e unspecified: Lacked sufficient epidemiological evidence to classify a human case as zoonotic or direct. 
e animal-to-animal: Transmission occurred from an animal to another animal. 


19. clinical: Describes whether the MERS-CoV occurrence demonstrated clinical signs of infection. Denoted 
by yes, no, or unknown. 


e yes: Clinical signs of infection were present/reported. Clinical signs among humans may range from mild (e.g. 
fever, cough) to severe (e.g. pneumonia, kidney failure). Clinical signs among camels include nasal discharge. 

e no: Clinical signs of infection were not present/reported. 

e unknown: Subject(s) may or may not have been demonstrating clinical signs of infection. For example, some 
authors did not explicitly mention symptoms, but individuals reportedly sought medical care. Another exam- 
ple being when a diagnostic serosurvey was conducted during an ongoing outbreak. The term “unknown” was 
used when articles lacked sufficient evidence for extractors to definitively label as “yes” or “no”. 


20. diagnostic: Describes the class of diagnostic method that was used. PCR, serology, or reported. 

21. diagnostic_note: More detailed information related to the specific test used (e.g. rk39, IgG, or IgM 
serology). 

22. serosurvey: Describes the context if serological testing was used. 


e diagnostic: testing of symptomatic patients. 
e exploratory: historic exposure determined among healthy asymptomatic individuals. 


23. country: ISO3 code for country in which the case occurred. 

24. origin: Open-ended field to provide more details on the specific in-country location of MERS-CoV case. 

25. problem_geography: This field was utilized if the MERS-CoV case was reported in a location that could 
cause uncertainty when determining exact geographic occurrence (e.g. hospital, abattoir). 

26. lat: Latitude measured in decimal degrees. 

27. long: Longitude measured in decimal degrees. 

28. latlong_source: The source from which latitude and longitude were derived. 

29. loc_confidence: States the level of confidence that researchers had when assigning a geographic location 
to the MERS-CoV case (good or bad). An answer of ‘good’ meant the article stated clearly that the case 
occurred in a specific geographic location and no assumptions were required on part of the researcher. An 
answer of ‘bad’ meant the article did not clearly state the specific geographic location of the MERS-CoV 
case, but the researcher was able to infer the location of occurrence. The field SITE_NOTES was utilized to 
detail the logic behind researchers’ decisions when inference was required. 

30. shape_type: The geographic shape type assigned to the MERS-CoV occurrence (point or polygon). 

31. poly_type: If the MERS-CoV occurrence was assigned a shape_type of polygon, was it admin (GAUL), 
custom, or buffer? 

32. buffer_radius: If a MERS-CoV occurrence was assigned a buffer, what is the radius in km? 

33. gaul_year_or_custom_shapefile: File path used to reach the necessary shape file in ArcGIS. Users of this 
dataset can find custom shapefiles created for this dataset at: https://cloud.inme.washington.edu/index. 
php/s/DGoyKYqnbjG54F2/download 

34. poly_id: A standardized and unique identifier assigned to each GAUL shapefile. 

35. poly_field: Which type of polygon was used to geo-position the occurrence? (e.g. if admin1 polygon was 
used, enter ADM1_CODE) 

36. site_notes: Miscellaneous notes regarding the site of occurrence. 

37. month_start: Month that the occurrence(s) began. If the article provided a specific month of illness onset, 
the month was assigned a number from 1-12 (1 = January, 2 = February, etc.). If the article did not pro- 
vide a specific month of illness onset, then researchers assigned a value of ‘NA’ 
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Fig.2 Geographic distribution of published detection of Middle East Respiratory Syndrome Coronavirus 
(MERS-CoV). Occurrences are layered from top to bottom in the following order: Index (green), Unspecified 
(orange), Mammal (yellow), Import (blue), Secondary (purple). Points were plotted using their assigned 
latitudes and longitudes, and shape files were created for polygons. Buffers were also plotted using assigned 
latitudes and longitudes, after which each buffer’s custom radius was drawn. Higher resolution geographies 
(points, buffers, governorates) were plotted on top of lower resolution geographies (countries, regions). 


38. month_end: Month that the occurrence(s) ended, defined as the date a patient tested negative for MERS- 
CoV. If the article provided a specific month for recovery, the month was assigned a number from 1-12 
(1 = January, 2 = February, etc.). If the article did not provide a specific month of symptom onset, then 
researchers assigned a value of NA. 

39. year_start: Year that the occurrence(s) began. If the year of illness onset was not provided in the article, the 
IHME standard was used: 
(year_start = publication year - 3). 

40. year_end: Year that the occurrence(s) ended. If the article did not provide a specific year for recovery, the 
IHME standard was used: 
(year_end = publication year - 1). 

41. year_accuracy: If years were reported, this field was assigned a value of ‘0. If assumptions were required, 
this field was assigned a value of ‘1’. 


Figures 2-4 show the geographic distribution of the MERS-CoV occurrence database, with distinctions made 
by epidemiological and geographic metadata. 


Technical Validation 

All data extracted from the original search (October 2012 to April 30, 2017) was reviewed independently by a sec- 
ond individual to check for accuracy. Challenging extractions from the updated search (May 1, 2017 to February 
22, 2018) were selected for group review during bi-weekly team meetings. Upon extraction completion, all data 
were checked to ensure they fell on land and within the correct country. 

While the protocol implemented above was designed to reduce the amount of subjective decisions made by 
extractors, total elimination was not possible. Wherever a subjective decision had to be made, the extractor uti- 
lized the various notes fields in order to document the logic behind decisions. These decisions were subsequently 
reviewed by other extractors. 


Usage Notes 

The techniques described here can be applied to collect and curate datasets for other infectious diseases, as has 
been previously demonstrated with dengue” and leishmaniasis'®. Additionally, since these data were collected 
independently through published reports of MERS-CoV occurrence, they may be used to build upon existing 
notification data**”’. Our ability to capture occurrences in this dataset is contingent on the data contained within 
published literature. Therefore, this dataset does not represent a total count of all cases. Instead, this dataset’s 
value lies within its geo-precision. Data were extracted with a focus on obtaining the highest resolution possible. 
These data may be merged with other datasets, such as WHO” or OIE” surveillance records, and are intended to 
complement, not replace, these resources. Together, published reports and notification data can provide a more 
comprehensive snapshot of current disease extent and at-risk locations. 

An important consideration, whether using the literature data alone, or in combination with other databases, 
is the potential for duplication. Various pieces of metadata can be used to evaluate where potential duplicates 
could lie, such as common date fields (month_start, month_end, year_start, year_end) or consistent geographic 
details (lat, long, poly_id, shape_type) or shared epidemiological tags (patient_type). Researchers may wish to 
consider further steps, such as fuzzy matching of geographic data (e.g. matching a point with an overlapping 
buffer) or temporal data (e.g. matching a precise month with an overlapping month interval). We acknowledge 
this duplicate-removal process will not catch all matching records, but it will likely catch several. We recommend 


SCIENTIFIC DATA | (2019) 6:318 | https://doi.org/10.1038/s41597-019-0330-0 5 


www.nature.com/scientificdata/ 


Fig. 3 Geographic distribution of detections of Middle East Respiratory Syndrome Coronavirus (MERS-CoV) 
in mammals. Mammal populations testing positive for MERS-CoV primarily consisted of camels but also 
included a sheep, hamadryas baboon, Egyptian tomb bat, and an alpaca. Points were plotted using their assigned 
latitudes and longitudes, and shape files were created for polygons. Buffers were also plotted using assigned 
latitudes and longitudes, after which each buffer’s custom radius was drawn. Higher resolution geographies 
(points, buffers, governorates) were plotted on top of lower resolution geographies (countries, regions). 
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Fig. 4 Geographic distribution of detections of Middle East Respiratory Syndrome Coronavirus (MERS-CoV) 
among cases tagged as Index or unspecified. Occurrences tagged as Index are coloured green, those tagged as 
unspecified are coloured orange. Points were plotted using their assigned latitudes and longitudes, and shape 
files were created for polygons. Buffers were also plotted using assigned latitudes and longitudes, after which 
each buffer’s custom radius was drawn. Higher resolution geographies (points, buffers, governorates) were 
plotted on top of lower resolution geographies (countries, regions). 


this approach because it will allow researchers to remove several duplicates without erroneously deleting any two 
occurrences that are truly unique (i.e. not duplicates). Essentially, we recommend a sensitive approach above a 
more specific approach, as the latter simply risks culling too many records that arent actually duplicates. 

When merging with other databases, consistency in metadata tagging is essential. For the WHO Disease 
Outbreak News data feed**”’ for instance, nomenclature for case definitions is slightly different, with WHO defi- 
nitions of “Community Acquired” and “Not Reported” comparable to “Index” and “Unspecified” respectively. In 
addition, it is important to recognize what information is beyond the scope of these additional databases. Again, 
when comparing to the WHO dataset, it is important to recognize that serologically positive cases do not meet the 
case definition used in the WHO database. These adjustments need to be identified on a dataset-to-dataset basis. 
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This database can be combined with other covariates (e.g. satellite imagery) to produce environmental suita- 
bility models of MERS-CoV infection risk and potential spillover on both global and regional scales as achieved 
with other exemplar datasets***!. This information can be useful in resource allocation aimed at improving dis- 
ease surveillance and contribute towards a better understanding of the factors facilitating continued emergence 
of index cases. 

The addition of sampling techniques and prevalence data may improve this dataset. Researchers were ulti- 
mately unable to add these data due to inconsistencies in the way literature reported sampling techniques and 
prevalence date by geography. An attempt to extract these data using the current approach would have led to 
sporadic inclusion of this information and would not have been comprehensive for the entire dataset. Moving 
forward, we recommend authors report sampling technique and prevalence data at the highest resolution geog- 
raphy possible, as seen in Miguel et al.*”. We encourage continued presentation of paired epidemiological and 
geographic metadata that would allow for more detailed analysis in the future. 

This database may also be utilized in clinical settings to provide an evidence-base for diagnoses when used in 
conjunction with patient travel histories. Additionally, it can be used to identify geographies for surveillance, par- 
ticularly areas where MERS-CoV has been documented in animals but not humans (e.g. Ethiopia and Nigeria). 
Identifying locations for surveillance will, in turn, inform global health security. While models will increase the 
resolution at which these questions can be addressed, datasets such as this provide an initial baseline. 

A major limitation of this database is the potential for sampling bias, which stems from higher frequency of 
disease reporting within countries where there exists strong healthcare infrastructure and reporting systems. This 
database does not attempt to account for such biases, which must be addressed in subsequent modelling activities 
where such biases are of consequence. Similarly, another limitation is potential duplicate documentation of sin- 
gular occurrences. This can happen when the same occurrence is assigned different geographies (e.g. point, pol- 
ygon) in multiple publications. Even though extractors made efforts to prospectively manually identify duplicate 
occurrences, this was challenging because the process relied upon papers providing sufficient details for extractors 
to determine a duplicate occurrence (e.g. geography, patient demographics, dates of occurrence, diagnostic meth- 
ods, etc.). However, the majority of papers did not report such details for each occurrence. In those instances, it 
was impossible for extractors to discern whether occurrences may have been duplicates from a previous artic le. 
Even case studies inconsistently reported patient details and demographic information. These are some examples 
of challenges faced by extractors when we attempted to identify duplicates. Without sufficient contextual clues, 
extractors lacked evidence to determine duplicity and thus likely extracted some unique occurrences more than 
once. Despite efforts to remove duplicate occurrences from the database, it is possible that some remain. 

Geographic uncertainty is similarly problematic for analyses such as this. In some cases, polygons, as opposed 
to points, are utilised as a geographic frame of reference, reflecting the uncertainty in geotagging in the articles 
themselves. For some occurrences, there is a strong assumption that the geography listed corresponds to the site 
of infection. While the use of 5km x 5km as the minimum geographical unit allows for some leeway in this pre- 
cision, it is possible that even with the point data (often corresponding to household clusters) these may not map 
directly with true infection sites. This must be considered in any subsequent geospatial analysis. 

Finally, this database represents a time-bounded survey of the literature. While all efforts were made to be 
comprehensive within this period, articles, and therefore data, will continue to be published. Efforts to streamline 
ongoing collection processes are still to be fully realized’. Regardless, we hope that this dataset provides a solid 
baseline for further iteration. 
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